STA 9750 Final-Project: A Tale of 59 NYC Community Districts

Author

Eduardo Alarcon

Introduction

The Overarching and Specific Questions

The COVID-19 pandemic established new urban norms, transforming job proximity from a strict requirement into a flexible option through remote work. This paradigm shift necessitates an analysis of how traditional value determinants adapted to these new dynamics. Our research team’s work addresses the overarching question (OQ):

Did COVID-19 reshape the relationship between neighborhood characteristics and property values across NYC’s CDs?

While the team’s broader effort examines crime, density, job accessibility, and transit, this analysis focuses on Educational Attainment (EA). Historically, high-education neighborhoods commanded significant price premiums from well-known agglomeration effects. However, the pandemic disrupted these patterns, leading to this research’s focus, which seeks to answer the specific question (SQ):

Did the strength of the relationship between neighborhood educational attainment and property values change post-COVID, and did this change differ across NYC’s Community Districts?

Hypotheses

Given the pandemic’s impact on work-life demands and redefining dwelling needs, this question explores two hypotheses.

Hypothesis 1: The positive correlation between education and property values would strengthen post-COVID.

Hypothesis 2: High-education CDs would experience stronger property value growth post-COVID as remote work freed professionals to prioritize neighborhood amenities over commute times.

Beyond testing these hypotheses, this analysis also investigates whether pandemic-era shifts represent temporary disruption or a systemic realignment of urban real estate economics.

Data Acquisition and Processing

This analysis integrates four data sources through programmatic acquisition, requiring careful transformation to align incompatible geographic coding systems across federal and city databases.

Setup and Configuration

Loading required packages and defining global constants allows for programmatically assembling a 59 CD-level panel for 2017–2019 vs. 2021–2023 and merges it with baseline 2019 American Community Survey (ACS) education variables.

Show code
#| label: setup-and-helper-files
#| include: false
#| message: false
#| warning: false

# Core packages
library(dplyr)
library(ggplot2)
library(glue)
library(janitor)
library(lubridate)
library(purrr)
library(readr)
library(readxl)
library(scales)
library(sf)
library(stringr)
library(tibble)
library(tidycensus)
library(tidyr)
library(utils)

# Reporting / inference
library(broom)   # tidy(), glance()
library(knitr)   # kable tables
library(infer)   # t_test
library(stats)   # descriptive statistics  

options(dplyr.summarise.inform = FALSE)

# NYC counties for ACS (Bronx, Brooklyn, Manhattan, Queens, Staten Island)
NYC_COUNTIES <- c("005", "047", "061", "081", "085")

# ACS B15003 variables we'll actually use
ACS_EDUCATION_VARS <- c(
  "B15003_001",  # Total population 25+
  "B15003_022",  # BA
  "B15003_023",  # MA
  "B15003_024",  # Professional
  "B15003_025"   # Doctorate
)

# Period definitions for pre/post COVID
PERIOD_DEFINITIONS <- list(
  pre_covid = list(
    acs_end_year = 2019,          # ACS 2015–2019 (baseline)
    sales_years  = 2017:2019,
    label        = "Pre-COVID (2017–2019)"
  ),
  post_covid = list(
    acs_end_year = 2019,          # SAME baseline ACS 2015–2019
    sales_years  = 2021:2023,
    label        = "Post-COVID (2021–2023)"
  )
)

# Labels for period variables used in tables/plots
PERIOD_LABELS <- c(
  pre_covid  = "Pre-COVID",
  post_covid = "Post-COVID"
)

clean_period <- function(x) {
  recode(x,!!!PERIOD_LABELS,
         .default = as.character(x)
         )
}

# Labels for periods
PERIOD_LABELS <- c(
  pre_covid  = "Pre-COVID",
  post_covid = "Post-COVID"
)

clean_period <- function(x) {
  recode(x,
    !!!PERIOD_LABELS,
    .default = as.character(x)
  )
}

NYC Community District Shapefile

The get_nyc_cd() function retrieves the official shapefile from the NYC Department of City Planning, unzips it, and constructs standardized identifiers for all 59 Community Districts.

Show code
#| label: get-nyc-cd
#| include: false

get_nyc_cd <- function() {
  cd_url   <- "https://s-media.nyc.gov/agencies/dcp/assets/files/zip/data-tools/bytes/community-districts/nycd_25c.zip"
  zip_path <- "data_raw/nycd_25c.zip"
  dir_out  <- "data_raw/nycd_25c"
  rds_path <- "data_clean/nycd_25c.rds"
  
  # Return cached version if it exists
  if (file.exists(rds_path)) {
    return(readRDS(rds_path))
  }
  
  # Ensure directories exist
  if (!dir.exists("data_raw"))   dir.create("data_raw")
  if (!dir.exists("data_clean")) dir.create("data_clean")
  
  # Download shapefile zip if needed
  if (!file.exists(zip_path)) {
    download.file(cd_url, destfile = zip_path, mode = "wb")
  }
  
  # Unzip if needed
  if (!dir.exists(dir_out) || length(list.files(dir_out, recursive = TRUE)) == 0) {
    unzip(zip_path, exdir = dir_out)
  }
  
  # Find the actual .shp file inside the unzipped directory
  shp_path <- list.files(
    dir_out,
    pattern    = "\\.shp$",
    full.names = TRUE,
    recursive  = TRUE
  )[1]
  
  # Read shapefile, enforce CRS, and build CD IDs
  cd_sf <- st_read(shp_path, quiet = TRUE) |>
    st_transform("EPSG:2263") |>  # explicit CRS: NAD83 / NY Long Island (ftUS)
    mutate(
      boro_cd  = as.integer(BoroCD),                    # e.g., 101, 207, 503
      boro_num = substr(sprintf("%03d", boro_cd), 1, 1),
      boro_abbr = case_when(
        boro_num == "1" ~ "MN",
        boro_num == "2" ~ "BX",
        boro_num == "3" ~ "BK",
        boro_num == "4" ~ "QN",
        boro_num == "5" ~ "SI",
        TRUE            ~ NA_character_
      ),
      cd_num = sprintf("%02d", boro_cd %% 100),         # 01–18
      cd_id  = paste0(boro_abbr, cd_num)                # MN01, BK03, etc.
    ) |>
    # keep only the 59 standard CDs, drop Joint Interest Areas
    filter(as.integer(cd_num) <= 18)
  
    # Check for duplicate CD IDs
  dup_ids <- cd_sf |>
    count(cd_id) |>
    filter(n > 1)

  if (nrow(dup_ids) > 0) {
    stop("Duplicate cd_id values found: ", paste(dup_ids$cd_id, collapse = ", "))
  }

  # Validate that we have exactly 59 CDs
  if (nrow(cd_sf) != 59) {
    stop("Expected 59 Community Districts but got ", nrow(cd_sf))
  }

  message("Successfully loaded 59 Community Districts")
  

  saveRDS(cd_sf, rds_path)
  cd_sf
}

PLUTO BBL-to-CD Crosswalk

The get_pluto_cd_crosswalk() function uses raw data from NYC Open Data to implement a Borough–Block–Lot (BBL) to CD crosswalk. This process normalizes approximately 870,000 BBLs, resolving formatting inconsistencies and creating standardized linkages between individual tax lots and CDs.

Show code
get_pluto_cd_crosswalk <- function() {
  rds_path <- "data_clean/pluto_cd_crosswalk.rds"

  if (file.exists(rds_path)) {
    return(readRDS(rds_path))
  }

  if (!dir.exists("data_clean")) dir.create("data_clean")
  if (!dir.exists("data_raw"))   dir.create("data_raw")

  # NYC Open Data API for PLUTO (only bbl and cd columns)
  pluto_url <- paste0(
    "https://data.cityofnewyork.us/resource/64uk-42ks.csv?",
    "$select=bbl,cd&$limit=5000000"
  )

  message("Downloading PLUTO crosswalk from NYC Open Data...")
  pluto_raw <- read_csv(
    pluto_url,
    col_types = cols(
      bbl = col_character(),
      cd  = col_character()
    )
  )

  message("  Original PLUTO rows: ", nrow(pluto_raw))

  pluto_xwalk <- pluto_raw |>
    filter(!is.na(cd), cd != "0", cd != "99") |>
    mutate(
      # Strip trailing ".0", ".00", etc. if present
      bbl = str_replace(bbl, "\\.0+$", ""),
      bbl_length = nchar(bbl),
      boro_cd    = as.integer(cd)
    ) |>
    # Keep only clean, 10-character BBLs
    filter(bbl_length == 10) |>
    select(bbl, boro_cd)

  message("  After filtering invalid cd / BBL length: ", nrow(pluto_xwalk))
  message("  Removed rows with invalid cd or bbl_length != 10")

  # Diagnostic: check BBL format consistency
  bbl_lengths <- table(nchar(pluto_xwalk$bbl))
  if (length(bbl_lengths) > 1 || names(bbl_lengths)[1] != "10") {
    warning("BBL length inconsistency detected. All BBLs should be 10 characters.")
    print(bbl_lengths)
  }

  # Lookup table from CD shapefile
  cd_lookup <- get_nyc_cd() |>
    st_drop_geometry() |>
    select(boro_cd, cd_id) |>
    distinct()

  crosswalk <- pluto_xwalk |>
    left_join(cd_lookup, by = "boro_cd") |>
    filter(!is.na(cd_id)) |>
    distinct(bbl, cd_id, boro_cd)

  message("  Crosswalk built with ", nrow(crosswalk), " rows.")
  message("  CDs represented: ", length(unique(crosswalk$cd_id)))

  saveRDS(crosswalk, rds_path)
  crosswalk
}

This process removed 1,154 invalid entries (0.1%), preventing silent join failures downstream, as shown in Table 1 below.

Show code
#| label: pluto-diagnostics
#| echo: false
#| message: false
#| warning: false

# Get diagnostics from function
pluto_raw <- read_csv(
  "https://data.cityofnewyork.us/resource/64uk-42ks.csv?$select=bbl,cd&$limit=5000000",
  col_types = cols(bbl = col_character(), cd = col_character()),
  show_col_types = FALSE
)

pluto_clean <- pluto_raw |>
  filter(!is.na(cd), cd != "0", cd != "99") |>
  mutate(bbl = str_replace(bbl, "\\.0+$", "")) |>
  filter(nchar(bbl) == 10)

# Create summary table
tibble(
  Stage = c("Raw PLUTO records", "After BBL normalization", "Records removed"),
  Count = c(
    comma(nrow(pluto_raw)),
    comma(nrow(pluto_clean)),
    comma(nrow(pluto_raw) - nrow(pluto_clean))
  ),
  Percentage = c(
    "100%",
    paste0(round(100 * nrow(pluto_clean) / nrow(pluto_raw), 1), "%"),
    paste0(round(100 * (nrow(pluto_raw) - nrow(pluto_clean)) / nrow(pluto_raw), 1), "%")
  )
) |>
  kable(caption = "**Table 1: PLUTO BBL Normalization Results**")
Table 1: PLUTO BBL Normalization Results
Stage Count Percentage
Raw PLUTO records 858,284 100%
After BBL normalization 857,130 99.9%
Records removed 1,154 0.1%

Department of Finance Rolling Sales Data

The get_dof_sales_year_boro() function automates collection and cleaning Annualized Sales reports from the NYC Department of Finance (DOF). Also, quality filters remove transactions under $10,000, restricting data collection to residential tax classes (e.g., 1, 2, 2A, 2B, and 2C).

Show code
# Chunk 4: DOF 

get_dof_sales_year_boro <- function(year, borough_name) {
  # 1. Validate inputs  
  valid_boroughs <- c("manhattan", "bronx", "brooklyn", "queens", "staten_island")
  if (!borough_name %in% valid_boroughs) {
    stop(
      "Invalid `borough_name`: ", borough_name,
      ". Must be one of: ", paste(valid_boroughs, collapse = ", ")
    )
  }
  
  # 2. URLs  
  base_url <- "https://www.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales"
  ext <- if (year == 2017) "xls" else "xlsx"
  
  slug_primary <- if (borough_name == "staten_island") "staten_island" else borough_name
  slug_alt     <- if (borough_name == "staten_island") "statenisland" else borough_name
  
  url_primary <- glue("{base_url}/{year}/{year}_{slug_primary}.{ext}")
  url_alt     <- if (slug_alt != slug_primary) {
    glue("{base_url}/{year}/{year}_{slug_alt}.{ext}")
  } else {
    NA_character_
  }
  
  # 3. Check cache  
  cache_file <- glue("data_clean/dof_{year}_{borough_name}.rds")
  
  if (file.exists(cache_file)) {
    message("  ✓ Using cached: ", year, " ", borough_name)
    return(readRDS(cache_file))
  }
  
  # 4. Download  
  if (!dir.exists("data_raw")) dir.create("data_raw")
  file_path <- glue("data_raw/dof_sales_{year}_{borough_name}.{ext}")
  
  if (!file.exists(file_path)) {
    message("  Downloading: ", year, " ", borough_name, " ...")
    
    download_ok <- FALSE
    
    try({
      download.file(url_primary, file_path, mode = "wb", quiet = TRUE)
      download_ok <- TRUE
    }, silent = TRUE)
    
    if (!download_ok && !is.na(url_alt)) {
      message("    Trying alternate URL...")
      try({
        download.file(url_alt, file_path, mode = "wb", quiet = TRUE)
        download_ok <- TRUE
      }, silent = TRUE)
    }
    
    if (!download_ok) {
      warning("Failed to download: ", year, " ", borough_name)
      return(NULL)
    }
  }
  
  # 5. Detect header 
  preview <- read_excel(file_path, col_names = FALSE, n_max = 20)
  
  header_row <- which(apply(preview, 1, function(x) any(grepl("BOROUGH", x, ignore.case = TRUE))))
  
  if (length(header_row) == 0) {
    warning("Could not detect header for ", year, " ", borough_name)
    return(NULL)
  }
  
  skip_rows <- header_row[1] - 1
  
  # 6. Read file, with readxl support to determine column types  
  message("  Reading: ", year, " ", borough_name)
  raw_sales <- suppressWarnings(
    read_excel(
      file_path, 
      skip = skip_rows
    )
  ) |>
    clean_names()
  
  message("    Rows: ", nrow(raw_sales))
  
  if (nrow(raw_sales) == 0L) {
    warning("No data rows for ", year, " ", borough_name)
    return(NULL)
  }
  
  # 7. Check required columns 
  required_cols <- c("borough", "block", "lot", "sale_price", "sale_date")
  missing_cols  <- setdiff(required_cols, names(raw_sales))
  
  if (length(missing_cols) > 0) {
    warning("Missing columns: ", paste(missing_cols, collapse = ", "))
    return(NULL)
  }
  
  # 8. Convert and standardize
  sales_clean <- raw_sales |>
    mutate(
      # Basic numeric conversions
      block = suppressWarnings(as.integer(block)),
      lot   = suppressWarnings(as.integer(lot)),
      sale_price = suppressWarnings(as.numeric(sale_price)),
      gross_square_feet = suppressWarnings(as.numeric(gross_square_feet)),
      
      # Date conversion to handle numeric Excel dates and text dates
      sale_date = suppressWarnings(case_when(
        inherits(sale_date, c("Date", "POSIXct", "POSIXt")) ~ as.Date(sale_date),
        is.numeric(sale_date) ~ as.Date(sale_date, origin = "1899-12-30"),
        is.character(sale_date) ~ mdy(sale_date),
        TRUE ~ as.Date(NA)
      )),

      # Borough code
      borough_code = case_when(
        borough_name == "manhattan"     ~ "1",
        borough_name == "bronx"         ~ "2",
        borough_name == "brooklyn"      ~ "3",
        borough_name == "queens"        ~ "4",
        borough_name == "staten_island" ~ "5"
      ),
      
      # Construct BBL
      bbl = paste0(
        borough_code,
        str_pad(block, 5, pad = "0"),
        str_pad(lot, 4, pad = "0")
      )
    )
  
  # 9. Find and standardize tax_class
  tax_class_candidates <- c(
    "tax_class_at_time_of_sale",
    "tax_class_at_present",
    "tax_class_at_time_of_sale_2",
    "tax_class"
  )
  
  tax_class_col <- tax_class_candidates[tax_class_candidates %in% names(sales_clean)][1]
  
  if (is.na(tax_class_col)) {
    warning("No tax_class column found for ", year, " ", borough_name)
    return(NULL)
  }
  
  sales_clean <- sales_clean |>
    mutate(
      tax_class = str_trim(as.character(.data[[tax_class_col]]))
    )
  
  # 10. Apply filters 
sales_clean <- sales_clean |>
    filter(
      !is.na(bbl),
      !is.na(sale_price),
      sale_price >= 10000,
      !is.na(sale_date),
      !is.na(tax_class),
      tax_class %in% c("1", "2", "2A", "2B", "2C")
    ) |>
    select(
      bbl,
      sale_price,
      sale_date,
      gross_square_feet,
      tax_class,
      borough_code,
      block,
      lot
    )

  n_before_filter <- nrow(raw_sales)
  n_after_filter  <- nrow(sales_clean)
  n_removed       <- n_before_filter - n_after_filter

  if (n_removed > 0) {
    message(
      "    Filtered out: ", comma(n_removed), " rows (",
      round(100 * n_removed / n_before_filter, 1), "%)"
    )
  }
  
  # 11. Cache 
  if (!dir.exists("data_clean")) dir.create("data_clean")
  saveRDS(sales_clean, cache_file)
  
  sales_clean
}

Enhanced BBL Matching: Two-Stage Approach


A two-stage join strategy combines exact BBL matching with a block-level fallback approach. As standard exact-match joins lose approximately 25% of transactions due to condo billing BBLs, the fallback allows the function to join the remaining unmatched sales record to its specific city block. This process results in 100% match rate, minimizing inaccuracies in analyses within high-density areas, as shown in Table 2.

Show code
#| label: define-cd-sales-panel
#| include: false

# Build CD-Level Sales Panel from DOF + PLUTO 

build_cd_sales_panel <- function(
  pre_years   = c(2017, 2018, 2019),
  post_years  = c(2021, 2022, 2023),
  boroughs    = c("bronx", "brooklyn", "manhattan", "queens", "staten_island"),
  pluto_xwalk = NULL
) {
  # 0. Dependencies / inputs 
  if (is.null(pluto_xwalk)) {
    pluto_xwalk <- get_pluto_cd_crosswalk()
  }

  # 1. Year/borough grid 
  combos <- expand_grid(
    year    = c(pre_years, post_years),
    borough = boroughs
  ) |>
    arrange(year, borough)

  message("=== Building CD sales panel ===")
  message("Pre-COVID years:  ", paste(pre_years,  collapse = ", "))
  message("Post-COVID years: ", paste(post_years, collapse = ", "))

  sales_list        <- vector("list", nrow(combos))
  match_summary_vec <- vector("list", nrow(combos))

  # 2. Loop over year × borough combos  
  for (i in seq_len(nrow(combos))) {
    yr <- combos$year[i]
    bo <- combos$borough[i]

    message("  [", i, "/", nrow(combos), "] Processing year ", yr, " – ", bo)

    sales <- get_dof_sales_year_boro(yr, bo)

    if (is.null(sales) || nrow(sales) == 0L) {
      warning("    No usable rows for ", yr, " ", bo)

      match_summary_vec[[i]] <- tibble(
        year       = yr,
        borough    = bo,
        n_total    = 0L,
        n_matched  = 0L,
        match_rate = NA_real_
      )
      next
    }

    # 2a. Match stats vs. PLUTO crosswalk (per-file diagnostic)
    n_total   <- nrow(sales)
    matched   <- inner_join(sales, pluto_xwalk, by = "bbl")
    n_matched <- nrow(matched)
    match_rate <- if (n_total > 0L) n_matched / n_total else NA_real_

    message(
      sprintf(
        "    n_total = %s, n_matched = %s (%.1f%%)",
        comma(n_total),
        comma(n_matched),
        100 * match_rate
      )
    )

    sales_list[[i]] <- sales |>
      mutate(
        year    = yr,
        borough = bo
      )

    match_summary_vec[[i]] <- tibble(
      year       = yr,
      borough    = bo,
      n_total    = n_total,
      n_matched  = n_matched,
      match_rate = match_rate
    )
  }

  # 3. Bind all years/boroughs together  
  all_sales <- bind_rows(sales_list)
  match_summary_all <- bind_rows(match_summary_vec)

  message("")
  message("=== Per-File Match Summary ===")
  # print(match_summary_all |> arrange(year, borough), n = Inf)
  
  assign("sales_match_summary", match_summary_all, envir = .GlobalEnv)

  # 4. Collapse to CD–year panel with Enhanced Matching  
  
  # 4a. Exact BBL match
  sales_matched_exact <- all_sales |>
    inner_join(pluto_xwalk, by = "bbl")
  
  n_exact <- nrow(sales_matched_exact)
  
  # 4b. For unmatched, try block-level matching (for condos)
  sales_unmatched <- all_sales |>
    anti_join(pluto_xwalk, by = "bbl")
  
  if (nrow(sales_unmatched) > 0) {
    message("")
    message("=== Attempting Block-Level Matching ===")
    message("Unmatched sales: ", comma(nrow(sales_unmatched)))
    
    # Create block-level lookup from PLUTO
    # For each block, find the most common CD (handles edge cases)
    pluto_block_lookup <- pluto_xwalk |>
      mutate(
        block = as.integer(substr(bbl, 2, 6)),
        boro_digit = substr(bbl, 1, 1)
      ) |>
      count(boro_digit, block, cd_id, boro_cd) |>
      group_by(boro_digit, block) |>
      slice_max(n, n = 1, with_ties = FALSE) |>
      ungroup() |>
      select(boro_digit, block, cd_id, boro_cd)
    
    # Match unmatched sales by block
    sales_matched_block <- sales_unmatched |>
      mutate(boro_digit = substr(bbl, 1, 1)) |>
      inner_join(
        pluto_block_lookup,
        by = c("boro_digit", "block")
      ) |>
      select(-boro_digit)
    
    n_block <- nrow(sales_matched_block)
    
    message("Block-matched: ", comma(n_block))
    message("Still unmatched: ", comma(nrow(sales_unmatched) - n_block))
    
    # Combine exact and block matches
    sales_with_cd <- bind_rows(
      sales_matched_exact,
      sales_matched_block
    )
  } else {
    sales_with_cd <- sales_matched_exact
    n_block <- 0
  }
  
  # 4c. Aggregate to CD-year level
  cd_panel <- sales_with_cd |>
    mutate(
      period = case_when(
        year %in% pre_years  ~ "pre_covid",
        year %in% post_years ~ "post_covid",
        TRUE                 ~ NA_character_
      )
    ) |>
    filter(!is.na(period)) |>
    group_by(cd_id, boro_cd, year, period) |>
    summarise(
      n_sales          = n(),
      median_price     = median(sale_price, na.rm = TRUE),
      mean_price       = mean(sale_price, na.rm = TRUE),
      sd_price         = sd(sale_price, na.rm = TRUE),
      total_sales_vol  = sum(sale_price, na.rm = TRUE),
      median_gsf       = median(gross_square_feet, na.rm = TRUE),
      n_gsf_nonmissing = sum(!is.na(gross_square_feet)),
      .groups = "drop"
    ) |>
    arrange(year, cd_id)

  # 5. Overall match statistics  
  total_sales_all_files <- nrow(all_sales)
  total_sales_matched   <- n_exact + n_block
  overall_match_rate <- total_sales_matched / total_sales_all_files

  message("")
  message("=== Overall Match Statistics ===")
  message("Total sales (all files):     ", comma(total_sales_all_files))
  message("Exact BBL matches:           ", comma(n_exact))
  message("Block-level matches:         ", comma(n_block))
  message("Sales matched to CDs:        ", comma(total_sales_matched))
  message("Sales NOT matched:           ",
          comma(total_sales_all_files - total_sales_matched))
  message("Overall match rate:          ",
          round(100 * overall_match_rate, 1), "%")

  if (overall_match_rate < 0.75) {
    warning("Overall match rate is below 75% - review BBL construction logic")
  }

  # 6. Validate CD coverage  
  cd_sf <- get_nyc_cd()
  expected_cds <- unique(cd_sf$cd_id)
  actual_cds   <- unique(cd_panel$cd_id)
  missing_cds  <- setdiff(expected_cds, actual_cds)

  if (length(missing_cds) > 0) {
    warning("Missing CDs in final panel: ",
            paste(missing_cds, collapse = ", "))
    message("  These CDs may have no matching sales in the selected years")
  }

  # 7. Flag low-volume CD-years  
  low_volume <- cd_panel |>
    filter(n_sales < 50)

  if (nrow(low_volume) > 0) {
    message("")
    message("Warning: ", nrow(low_volume),
            " CD-year combinations have fewer than 50 sales:")
    print(
      low_volume |>
        select(cd_id, year, period, n_sales) |>
        arrange(n_sales)
    )
  }

  # 8. Final summary  
  message("")
  message("=== Final CD Sales Panel ===")
  message("CD sales panel built with ", nrow(cd_panel), " rows.")
  message("Distinct CDs: ", n_distinct(cd_panel$cd_id))
  message("Years: ", paste(sort(unique(cd_panel$year)), collapse = ", "))
  message("Periods: ", paste(sort(unique(cd_panel$period)), collapse = ", "))

  cd_panel
}
Show code
#| label: matching-diagnostics
#| echo: false
#| message: false
#| warning: false

# Run the function (it creates cd_sales_panel and assigns sales_match_summary to global env)
cd_sales_panel <- build_cd_sales_panel(
  pre_years = PERIOD_DEFINITIONS$pre_covid$sales_years,
  post_years = PERIOD_DEFINITIONS$post_covid$sales_years
)

# Calculate Stage 1: Exact matches (from sales_match_summary)
stage1_total <- sum(sales_match_summary$n_total, na.rm = TRUE)
stage1_matched <- sum(sales_match_summary$n_matched, na.rm = TRUE)

# Calculate Stage 2: Total sales that made it into the panel
total_in_panel <- sum(cd_sales_panel$n_sales, na.rm = TRUE)

# Stage 2 additions = sales in panel that weren't exact matches
stage2_additions <- total_in_panel - stage1_matched

# Calculate match rate percentage
match_rate_pct <- round(100 * total_in_panel / stage1_total, 1)

# Create summary table
tibble(
  Stage = c(
    "Total sales (all files)",
    "Stage 1: Exact BBL matches",
    "Stage 2: Block-level matches",
    "Total matched to CDs",
    "Overall match rate"
  ),
  Count = c(
    comma(stage1_total),
    comma(stage1_matched),
    comma(stage2_additions),
    comma(total_in_panel),
    paste0(match_rate_pct, "%")
  ),
  Percentage = c(
    "100%",
    paste0(round(100 * stage1_matched / stage1_total, 1), "%"),
    paste0(round(100 * stage2_additions / stage1_total, 1), "%"),
    paste0(round(100 * total_in_panel / stage1_total, 1), "%"),
    ""
  )
) |>
  kable(
    caption = "**Table 2: Enhanced BBL Matching Results: Two-Stage Approach**",
    align = c("l", "l", "l")
  )
Table 2: Enhanced BBL Matching Results: Two-Stage Approach
Stage Count Percentage
Total sales (all files) 339,826 100%
Stage 1: Exact BBL matches 246,392 72.5%
Stage 2: Block-level matches 93,434 27.5%
Total matched to CDs 339,826 100%
Overall match rate 100%

Note: The code block below handles all data acquisition and pre-processing to establish the foundational datasets for this research. These background processes include structural validations to ensure data integrity before thorough analysis.

Show code
#| label: load-core-data
#| include: false
#| message: false
#| warning: false

# Load core spatial pieces
nyc_cd <- get_nyc_cd()
pluto_xwalk <- get_pluto_cd_crosswalk()

# Build the CD-level sales panel (suppress all output)
cd_sales_panel <- build_cd_sales_panel(
  pre_years = PERIOD_DEFINITIONS$pre_covid$sales_years,
  post_years = PERIOD_DEFINITIONS$post_covid$sales_years,
  pluto_xwalk = pluto_xwalk
)

# Structural validation (silent)
stopifnot(
  n_distinct(cd_sales_panel$cd_id) == 59L,
  n_distinct(cd_sales_panel$year) == 6L,
  nrow(cd_sales_panel) == 59L * 6L
)

American Community Survey Education Data

Sourced from Table B15003, the American Community Survey (ACS) dataset aggregates educational variables from approximately 2,200 tracts that do not perfectly align well with NYC CD boundaries. Therefore, it is best to use an area-weighted aggregation approach, which is a geographic method that redistributes data between mismatched areas to account for 100% of the population.

\[\text{Total}_{\text{Pop BA+}} = \sum_{\text{tracts}} \left( \text{Tract}_{\text{Pop BA+}} \times \frac{\text{Intersection Area}}{\text{Tract Area}} \right)\]

Show code
#| label: define-acs-cd-education
#| include: false

# Get ACS Education Data by Community District (area-weighted)
get_acs_cd_education <- function(end_year,
                                 cd_sf        = get_nyc_cd(),
                                 counties     = NYC_COUNTIES,
                                 period_label = NULL) {
  
  # 0. Cache check
  cache_file <- glue("data_clean/acs_education_{end_year}.rds")
  
  if (file.exists(cache_file)) {
    message("✓ Using cached ACS education data: ", end_year)
    
    # Read BASELINE ACS data (no period in the cache)
    base <- readRDS(cache_file)
    
    # Attach period label ONLY on return
    if (!is.null(period_label)) {
      base <- base |>
        mutate(period = period_label)
    }
    
    return(base)
  }
  
  message("=== Downloading ACS education data (", end_year, " 5-year) ===")
  
  # 1. Pull tract-level ACS data  
  acs_tracts <- suppressMessages(
    get_acs(
      geography   = "tract",
      variables   = ACS_EDUCATION_VARS,
      year        = end_year,
      survey      = "acs5",
      state       = "NY",
      county      = counties,
      geometry    = TRUE,
      cache_table = TRUE,
      output      = "wide"
    )
  ) |>
    clean_names() |>
    transmute(
      geoid,
      total_pop_25plus = b15003_001e,
      ba_plus          = b15003_022e + b15003_023e +
                         b15003_024e + b15003_025e,
      geometry
    ) |>
    filter(!is.na(total_pop_25plus), total_pop_25plus > 0)
  
  message("  Downloaded ", nrow(acs_tracts), " census tracts")
  
  # 2. Make geometries valid and align CRS  
  acs_tracts <- st_make_valid(acs_tracts)
  cd_sf      <- st_make_valid(cd_sf)
  cd_sf      <- cd_sf |> st_transform(st_crs(acs_tracts))
  
  # 3. Area-weighted intersection  
  message("  Computing area-weighted intersections...")
  
  inter <- suppressWarnings(
    st_intersection(
      acs_tracts,
      cd_sf |> select(cd_id)
    )
  )
  
  inter <- inter |>
    mutate(
      inter_area = as.numeric(st_area(geometry))
    )
  
  tract_areas <- acs_tracts |>
    transmute(
      geoid,
      tract_area = as.numeric(st_area(geometry))
    ) |>
    st_drop_geometry()
  
  inter <- inter |>
    st_drop_geometry() |>
    left_join(tract_areas, by = "geoid") |>
    mutate(
      weight = if_else(tract_area > 0, inter_area / tract_area, 0)
    )
  
  # 4. Apply weights and aggregate to CD level  
  cd_edu <- inter |>
    mutate(
      wt_total_pop_25plus = total_pop_25plus * weight,
      wt_ba_plus          = ba_plus * weight
    ) |>
    group_by(cd_id) |>
    summarise(
      total_pop_25plus = sum(wt_total_pop_25plus, na.rm = TRUE),
      ba_plus          = sum(wt_ba_plus, na.rm = TRUE),
      n_tracts         = n(),
      .groups          = "drop"
    ) |>
    mutate(
      pct_ba_plus_25plus = if_else(
        total_pop_25plus > 0,
        (ba_plus / total_pop_25plus) * 100,
        NA_real_
      ),
      acs_end_year = end_year
    )
  
  # 5. Backwards-compatible names for the rest of your pipeline  
  cd_edu <- cd_edu |>
    mutate(
      pct_ba_plus = pct_ba_plus_25plus,
      acs_year    = acs_end_year
    ) |>
    arrange(cd_id)
  
  # 6. Validate and report  
  n_cds <- n_distinct(cd_edu$cd_id)
  if (n_cds != 59) {
    warning("Expected 59 CDs, found ", n_cds)
  }
  
  message("  Aggregated to ", n_cds, " Community Districts")
  message(
    "  Education range: ",
    round(min(cd_edu$pct_ba_plus, na.rm = TRUE), 1), "% to ",
    round(max(cd_edu$pct_ba_plus, na.rm = TRUE), 1), "%"
  )
  
  # 7. Cache BASELINE (no period)  
  if (!dir.exists("data_clean")) dir.create("data_clean")
  saveRDS(cd_edu, cache_file)
  message("  ✓ Cached to: ", cache_file, "\n")
  
  # 8. Attach period label to the returned object  
  if (!is.null(period_label)) {
    cd_edu <- cd_edu |>
      mutate(period = period_label)
  }
  
  cd_edu
}

# Wrapper to build the pre/post panel using same baseline ACS  
build_cd_education_panel <- function() {
  message("=== Building CD Education Panel ===")
  
  pre <- get_acs_cd_education(
    end_year     = PERIOD_DEFINITIONS$pre_covid$acs_end_year,  # 2019
    period_label = "pre_covid"
  )
  
  post <- get_acs_cd_education(
    end_year     = PERIOD_DEFINITIONS$post_covid$acs_end_year, # 2019
    period_label = "post_covid"
  )
  
  panel <- bind_rows(pre, post)
  
  # Validate expected structure: 59 CDs × 2 periods
  stopifnot(
    "Expected 118 rows (59 CDs × 2 periods)" = nrow(panel) == 118,
    "Missing pre_covid CDs"                 = sum(panel$period == "pre_covid") == 59,
    "Missing post_covid CDs"                = sum(panel$period == "post_covid") == 59
  )
  
  message("=== Education Panel Complete ===")
  message("Total rows: ", nrow(panel))
  message("Expected: 118 (59 CDs × 2 periods)\n")
  
  panel
}
Show code
#| label: build-education-panel
#| include: false
#| message: false
#| warning: false

cd_education_panel <- build_cd_education_panel()

# Structural validation (silent)
stopifnot(
  n_distinct(cd_education_panel$cd_id)    == 59L,
  n_distinct(cd_education_panel$period)   == 2L,
  nrow(cd_education_panel)                       == 118L
)
Show code
#| label: cd-edu-summary
#| echo: false
#| message: false
#| warning: false

# 1. Guess which column in cd_education_panel is BA+ (in % or proportion)
ba_candidates <- names(cd_education_panel)[
  grepl("ba",  names(cd_education_panel), ignore.case = TRUE) |
  grepl("bach", names(cd_education_panel), ignore.case = TRUE)
]

if (length(ba_candidates) == 0) {
  stop(
    "Could not find a BA+ column in cd_education_panel.\n",
    "Please set ba_col manually to the correct column name."
  )
}

ba_col <- ba_candidates[1]
# If you know the correct name, you can override:
# ba_col <- "YOUR_COLUMN_NAME_HERE"

# 2. Standardize period labels to a simple Pre/Post flag
cd_edu_tmp <- cd_education_panel |>
  mutate(
    period2 = case_when(
      period %in% c("pre_covid", "Pre-COVID") ~ "Pre",
      period %in% c("post_covid", "Post-COVID") ~ "Post",
      TRUE ~ NA_character_
    )
  ) |>
  filter(period2 == "Pre")

# 3. Collapse to one row per CD with baseline BA+ value
cd_edu_summary <- cd_edu_tmp |>
  group_by(cd_id) |>
  summarise(
    ba_raw = first(.data[[ba_col]]),
    .groups = "drop"
  )

# 4. Convert to percent if necessary (0–1 → 0–100)
max_ba <- max(cd_edu_summary$ba_raw, na.rm = TRUE)

cd_edu_summary <- cd_edu_summary |>
  mutate(
    pct_ba = if (max_ba <= 1.5) ba_raw * 100 else ba_raw
  )

# 5. Build terciles (Low / Medium / High)
cd_edu_summary <- cd_edu_summary |>
  mutate(
    edu_tercile_num = ntile(pct_ba, 3),
    edu_tercile = case_when(
      edu_tercile_num == 1 ~ "Low",
      edu_tercile_num == 2 ~ "Medium",
      TRUE                ~ "High"
    ),
    edu_tercile = factor(edu_tercile, levels = c("Low", "Medium", "High"))
  ) |>
  select(cd_id, pct_ba, edu_tercile)

Treating EA as a fixed baseline is a critical research control. This step is necessary to isolate pure market demand from potential confounding variables. Updating education data post-COVID would obfuscate the source of price changes (e.g., preference changes or population shifts). Consequently, holding education constant at 2019 levels ensures that results yield an accurate measure of changing housing demand.

Table 3 below demonstrates this fixed baseline strategy, showing that the same 2019 ACS education data is consistently applied across all six research years (2017-2023), enabling clean isolation of pandemic-era market shifts.

Show code
#| label: integrated-panel-coverage
#| echo: false
#| message: false
#| warning: false

# Create integrated table showing sales years matched to fixed ACS baseline
integrated_coverage <- cd_sales_panel |>
  group_by(year) |>
  summarise(
    n_cds = n_distinct(cd_id),
    .groups = "drop"
  ) |>
  # Add period labels
  mutate(
    period = case_when(
      year %in% c(2017, 2018, 2019) ~ "Pre-COVID",
      year %in% c(2021, 2022, 2023) ~ "Post-COVID",
      TRUE ~ NA_character_
    )
  ) |>
  # Add ACS baseline year (constant 2019)
  mutate(acs_baseline = 2019) |>
  # Reorder columns
  select(
    Year = year,
    Period = period,
    `ACS Baseline` = acs_baseline,
    CDs = n_cds
  ) |>
  arrange(Year)

kable(
  integrated_coverage,
  caption = "**Table 3: Sales Years Matched to Fixed ACS Baseline (2019)**",
  align = c("l", "l", "l", "l")
)
Table 3: Sales Years Matched to Fixed ACS Baseline (2019)
Year Period ACS Baseline CDs
2017 Pre-COVID 2019 59
2018 Pre-COVID 2019 59
2019 Pre-COVID 2019 59
2021 Post-COVID 2019 59
2022 Post-COVID 2019 59
2023 Post-COVID 2019 59
Show code
# Structural validation checks (silent)
stopifnot(
  # Sales years per period
  identical(
    sort(unique(cd_sales_panel$year[cd_sales_panel$period == "pre_covid"])),
    PERIOD_DEFINITIONS$pre_covid$sales_years
  ),
  identical(
    sort(unique(cd_sales_panel$year[cd_sales_panel$period == "post_covid"])),
    PERIOD_DEFINITIONS$post_covid$sales_years
  ),
  # ACS end-years per period
  identical(
    sort(unique(cd_education_panel$acs_year[cd_education_panel$period == "pre_covid"])),
    PERIOD_DEFINITIONS$pre_covid$acs_end_year
  ),
  identical(
    sort(unique(cd_education_panel$acs_year[cd_education_panel$period == "post_covid"])),
    PERIOD_DEFINITIONS$post_covid$acs_end_year
  )
)

Temporal Scope and Final Integration

The core of this analysis compares two three-year periods: pre-COVID (2017-2019) and post-COVID (2021-2023).

Show code
#| label: temporal-coverage-summary
#| echo: false

temporal_scope <- tibble(
  Component = c(
    "Pre-COVID sales period",
    "Post-COVID sales period", 
    "Excluded year",
    "Education data (baseline)"
  ),
  Detail = c(
    "2017, 2018, 2019 (3 years)",
    "2021, 2022, 2023 (3 years)",
    "2020 (pandemic disruption)",
    "ACS 2015–2019 (5-year), used as baseline for both periods"
  )
)

kable(
  temporal_scope,
  caption = "**Table 4: Temporal Scope of Analysis**",
  col.names = c("Component", "Detail"),
  align = c("l", "l")
)
Table 4: Temporal Scope of Analysis
Component Detail
Pre-COVID sales period 2017, 2018, 2019 (3 years)
Post-COVID sales period 2021, 2022, 2023 (3 years)
Excluded year 2020 (pandemic disruption)
Education data (baseline) ACS 2015–2019 (5-year), used as baseline for both periods

Omitting 2020 is essential to ensure analysis integrity. Given the acute shocks from this year, any statistical anomalies may distort long-term trend analysis; thus, bypassing it yields a clearer view of the market’s post-COVID response.

The final data integration shown in Table 5 confirms critical data merging (e.g., median prices by CD-year) with education baselines held constant at 2019 levels.

Show code
#| label: merge-sales-education
#| include: false

# Merge sales + education (reuse cd_sales_panel from Chunk 5)
cd_panel <- cd_sales_panel |>
  left_join(
    cd_education_panel |>
      select(cd_id, period, pct_ba_plus, acs_year),
    by = c("cd_id", "period")
  )

# Validate merge
stopifnot(
  "Merge failed - row count mismatch" = nrow(cd_panel) == nrow(cd_sales_panel),
  "Missing education data after merge" = sum(is.na(cd_panel$pct_ba_plus)) == 0L
)
Show code
#| label: merged-panel-diagnostics
#| include: false
#| message: false
#| warning: false

# CD-period diagnostics
edu_checks <- cd_panel |>
  group_by(cd_id, period) |>
  summarise(
    n_years = n_distinct(year),
    n_rows = n(),
    n_edu_values = n_distinct(pct_ba_plus),
    n_acs_years = n_distinct(acs_year),
    .groups = "drop"
  )

# Compact summary
edu_checks_summary <- edu_checks |>
  summarise(
    min_years = min(n_years),
    max_years = max(n_years),
    min_edu_values = min(n_edu_values),
    max_edu_values = max(n_edu_values),
    min_acs_years = min(n_acs_years),
    max_acs_years = max(n_acs_years)
  )

kable(
  edu_checks_summary,
  caption = "**Table 5: Merged Panel Diagnostics**",
  col.names = c(
    "Min Years (CD Period)", "Max Years (CD Period)",
    "Min BA+ Values", "Max BA+ Values",
    "Min ACS Years", "Max ACS Years"
  ),
  align = rep("l", 6)
)
Table 5: Merged Panel Diagnostics
Min Years (CD Period) Max Years (CD Period) Min BA+ Values Max BA+ Values Min ACS Years Max ACS Years
3 3 1 1 1 1
Show code
# Validation assertions (silent)
stopifnot(
  "Each CD–period should have 3 years" = all(edu_checks$n_years == 3L),
  "Each CD–period should have constant pct_ba_plus" = all(edu_checks$n_edu_values == 1L),
  "Each CD–period should have constant acs_year" = all(edu_checks$n_acs_years == 1L)
)

The approaches highlighted in the Data Acquisition and Processing section create a robust panel of 354 CD-year observations, with 59 CDs encompassing six years worth of data. This structure facilitates this research’s Difference-in-Differences (DiD) analysis, ensuring that any identified trends accurately link to post-pandemic shifts.

Pre-COVID Analytical Framework

Creating the Analysis Set

This baseline analysis addresses the OQ by establishing EA as a strong neighborhood predictor pre-pandemic. Quantifying this “education premium” establishes the context for the study to determine whether the pandemic weakened or strengthened the link between EA and property values.

To analyze how these different CDs responded to the pandemic, this research applied a non-parametric, tercile grouping approach, stratifying EAs into “Low,” “Medium,” and “High” tiers to mitigate outlier effects when comparing distributions.

Table 6 shows clear delineation, with Low-education CDs yielding a 19% BA+ Attainment average, while High-education CDs average 53% BA+ Attainment. This variation establishes a tangible baseline for comparing how CD housing markets evolved during the pandemic era.

Show code
### Chunk 8 - Create Analysis Dataset

#| label: create-analysis-dataset
#| code-fold: true
#| code-summary: "Show code: Create analysis dataset with terciles and price changes"
#| output: false

# STEP 1: Collapse 354 CD-year observations to 59 CD-level averages
# This averaging reduces year-to-year noise and creates structure for DiD analysis
cd_analysis_simple <- cd_sales_panel |>
  mutate(
    # Simplified period labels for pivot_wider operation
    period_group = case_when(
      period == "pre_covid"  ~ "pre",
      period == "post_covid" ~ "post",
      TRUE ~ NA_character_
    )
  ) |>
  # Calculate average prices and total sales by CD and period
  group_by(cd_id, period_group) |>
  summarise(
    avg_median_price = mean(median_price, na.rm = TRUE),
    avg_mean_price   = mean(mean_price,   na.rm = TRUE),
    total_sales      = sum(n_sales,       na.rm = TRUE),
    .groups          = "drop"
  ) |>
  # Pivot to wide format: one row per CD with pre/post columns
  pivot_wider(
    names_from  = period_group,
    values_from = c(avg_median_price, avg_mean_price, total_sales),
    names_glue  = "{.value}_{period_group}"
  ) |>
  # STEP 2: Calculate price changes in both dollar and log terms
  mutate(
    price_change_dollars = avg_median_price_post - avg_median_price_pre,
    price_change_pct = (price_change_dollars / avg_median_price_pre) * 100,
    # Log transformation for percentage-based interpretation
    log_price_change     = log(avg_median_price_post) - log(avg_median_price_pre),
    log_price_change_pct = log_price_change * 100
  )

# STEP 3: Add baseline education data (2019 ACS, held constant)
cd_analysis_simple <- cd_analysis_simple |>
  left_join(
    cd_education_panel |>
      filter(period == "pre_covid") |>
      select(
        cd_id,
        pct_ba_plus_2019      = pct_ba_plus,
        total_pop_25plus_2019 = total_pop_25plus,
        ba_plus_2019          = ba_plus
      ),
    by = "cd_id"
  )

# STEP 4: Create education terciles (divide 59 CDs into 3 equal groups)
# Calculate tercile breaks at 33rd and 67th percentiles
education_tercile_breaks <- quantile(
  cd_analysis_simple$pct_ba_plus_2019,
  probs = c(0, 1/3, 2/3, 1),
  na.rm = TRUE
)

cd_analysis_simple <- cd_analysis_simple |>
  mutate(
    # Assign each CD to Low/Medium/High based on tercile breaks
    edu_tercile = cut(
      pct_ba_plus_2019,
      breaks = education_tercile_breaks,
      labels = c("Low", "Medium", "High"),
      include.lowest = TRUE
    ),
    # Extract borough from CD ID (first 2 characters)
    borough = case_when(
      substr(cd_id, 1, 2) == "MN" ~ "Manhattan",
      substr(cd_id, 1, 2) == "BK" ~ "Brooklyn",
      substr(cd_id, 1, 2) == "BX" ~ "Bronx",
      substr(cd_id, 1, 2) == "QN" ~ "Queens",
      substr(cd_id, 1, 2) == "SI" ~ "Staten Island",
      TRUE                        ~ NA_character_
    )
  )

# STEP 5: Rename columns for clarity in subsequent analysis
cd_analysis_simple <- cd_analysis_simple |>
  rename(
    price_pre    = avg_median_price_pre,
    price_post   = avg_median_price_post,
    n_sales_pre  = total_sales_pre,
    n_sales_post = total_sales_post
  )
Show code
#| label: analysis-dataset-validation
#| echo: false

# Education tercile ranges
tercile_summary <- cd_analysis_simple |>
  group_by(edu_tercile) |>
  summarise(
    n_cds         = n(),
    min_ba_plus   = min(pct_ba_plus_2019, na.rm = TRUE),
    max_ba_plus   = max(pct_ba_plus_2019, na.rm = TRUE),
    mean_ba_plus  = mean(pct_ba_plus_2019, na.rm = TRUE),
    median_ba_plus = median(pct_ba_plus_2019, na.rm = TRUE),
    .groups       = "drop"
  ) |>
  mutate(
    min_ba_plus    = round(min_ba_plus, 1),
    max_ba_plus    = round(max_ba_plus, 1),
    mean_ba_plus   = round(mean_ba_plus, 1),
    median_ba_plus = round(median_ba_plus, 1)
  )

kable(
  tercile_summary,
  caption = "**Table 6. Education Tercile Ranges**",
  col.names = c(
    "Education Group",
    "Number of CDs",
    "Min BA+ (%)",
    "Max BA+ (%)",
    "*Mean BA+ (%)",
    "Median BA+ (%)"
  ),
  align = c("l", "l", "l", "l", "l", "l")
)
Table 6. Education Tercile Ranges
Education Group Number of CDs Min BA+ (%) Max BA+ (%) *Mean BA+ (%) Median BA+ (%)
Low 20 11.7 27.3 19.3 19.4
Medium 19 28.5 40.2 34.1 34.5
High 20 40.5 82.5 59.6 52.8

This tercile structure enables parallel-trends testing.

Note: The 40.1 percentage point (pp) gap between high and low tercile means (59.5% - 19.4%) will be part of later internal consistency checks in the regression analysis.

Pre-Trend Diagnostics

A DiD design is a favorable approach to filter out factors (e.g., economic shifts and neighborhood characteristics) impacting trends, allowing for a strict focus on the pandemic’s effect. This research evaluates a Parallel-Trends Assumption (PTA) framework to support a DiD interpretation of genuine structural shift in market behavior rather than pre-existing trends.

Logarithmic Transformation

To meaningfully execute this comparison, this analysis implements a logarithmic transformation of property values. Because High-EA CDs begin at significantly higher baselines, analyzing raw dollars would blur the comparison of trends. This standardization process yields relative appreciation rates, ensuring comparability across terciles. For example, a 0.10 log point shift results in an approximate 10% change in value.

Figure 1 shows how High-EA CDs start at roughly twice the price level of low-education districts. However, the three trajectories rise at similar rates, reinforcing the PTA.

Show code
#| label: fig-pretrend-diagnostic
#| echo: false
#| message: false
#| warning: false
#| fig-width: 8
#| fig-height: 5

# STEP 1: Prepare pre-trend data for visualization
# Filter to pre-COVID years only and join tercile assignments
pretrend_data <- cd_sales_panel |>
  filter(period == "pre_covid") |>
  left_join(
    cd_analysis_simple |>
      select(cd_id, edu_tercile),
    by = "cd_id"
  ) |>
  # STEP 2: Calculate average log price by tercile and year
  # Log transformation normalizes price differences across terciles
  group_by(edu_tercile, year) |>
  summarise(
    avg_log_price = mean(log(median_price), na.rm = TRUE),
    n_cds         = n(),
    .groups       = "drop"
  ) |>
  filter(!is.na(edu_tercile))

# STEP 3: Create parallel trends visualization
ggplot(
  pretrend_data, 
  aes(x = year, y = avg_log_price, color = edu_tercile, group = edu_tercile)
) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3) +
  scale_color_manual(
    values = c("Low" = "#d95f02", "Medium" = "#7570b3", "High" = "#1b9e77"),
    name   = "Education Level"
  ) +
  scale_x_continuous(breaks = c(2017, 2018, 2019)) +
  labs(
    title    = "Figure 1: Pre-COVID Price Trends by Education Level (2017–2019)",
    subtitle = "Average log median sale price by education tercile",
    x        = "Year",
    y        = "Log Median Sale Price"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    legend.position   = "bottom",
    plot.title        = element_text(face = "bold", size = 14),
    plot.subtitle     = element_text(size = 11),
    panel.grid.minor  = element_blank(),
    panel.grid.major.x = element_blank()
  )

Table 7 below highlights stable growth consistency, with annual growth ranging from 3.0% to 8.2%.

Show code
#| label: pretrend-slopes
#| echo: false

# Calculate trend slopes for each education tercile
# Slope represents annual log-point growth rate (2017-2019)
pretrend_slopes <- pretrend_data |>
  group_by(edu_tercile) |>
  summarise(
    # Extract slope coefficient from simple linear regression
    slope = coef(lm(avg_log_price ~ year))[2],
    .groups = "drop"
  ) |>
  mutate(
    # Convert log-point slope to annualized percentage growth
    # Formula: (exp(slope) - 1) * 100
    annual_pct_growth = round((exp(slope) - 1) * 100, 1),
    slope = round(slope, 3)
  )

kable(
  pretrend_slopes,
  caption = "**Table 7: Pre-COVID Annual Price Growth by Education Group**",
  col.names = c(
    "Education Group",
    "Log-Point Slope",
    "Annual % Growth"
  ),
  align = c("l", "l", "l")
)
Table 7: Pre-COVID Annual Price Growth by Education Group
Education Group Log-Point Slope Annual % Growth
Low 0.079 8.2
Medium 0.030 3.0
High 0.043 4.4
Show code
# Calculate maximum slope difference for parallel trends test
# Small difference (<0.05) supports parallel trends assumption
max_diff <- max(pretrend_slopes$slope) - min(pretrend_slopes$slope)

Small inter-group differences of 0.049 satisfies the PTA requirements, confirming that post-pandemic changes represent structural shifts rather than trajectory continuations.

Education and Value: A Lesson on Geography

NYC’s exceptionally diverse EA landscape requires establishing baseline disparities before testing pandemic impacts, as shown in Table 8 below.

Show code
#| label: descriptive-statistics
#| echo: false
#| message: false
#| warning: false

# Education distribution summary
edu_distribution <- cd_analysis_simple |>
  summarise(
    n_cds          = n(),
    mean_ba_plus   = mean(pct_ba_plus_2019, na.rm = TRUE),
    sd_ba_plus     = sd(pct_ba_plus_2019, na.rm = TRUE),
    min_ba_plus    = min(pct_ba_plus_2019, na.rm = TRUE),
    q25_ba_plus    = quantile(pct_ba_plus_2019, 0.25, na.rm = TRUE),
    median_ba_plus = median(pct_ba_plus_2019, na.rm = TRUE),
    q75_ba_plus    = quantile(pct_ba_plus_2019, 0.75, na.rm = TRUE),
    max_ba_plus    = max(pct_ba_plus_2019, na.rm = TRUE)
  ) |>
  mutate(
    across(-n_cds, ~ paste0(round(.x, 1), "%"))
  )

kable(
  edu_distribution,
  caption = "**Table 8: Distribution of EA Across NYC CDs (2019)**",
  col.names = c(
    "CDs",
    "Mean",
    "SD",
    "Min",
    "25th %ile",
    "Median",
    "75th %ile",
    "Max"
  ),
  align = c("l", rep("l", 7))
)
Table 8: Distribution of EA Across NYC CDs (2019)
CDs Mean SD Min 25th %ile Median 75th %ile Max
59 37.7% 19.9% 11.7% 24.4% 34.5% 43.9% 82.5%

EA varies widely across CDs (SD ≈ 19.8 pp). The 18.9 pp gap between the 25th and 75th percentiles supports tercile grouping, which better accounts for extreme shifts in CD behavior.

Table 9 below further highlights these disparities, with BA+ Attainment ranging between 11.8% (e.g., BX01, the South Bronx) to a maximum of 82.7% (e.g., MN05, the Upper East Side), representing a 70 pp difference.

Show code
#| label: top-bottom-cds
#| echo: false

# Top and bottom 3 CDs by education
top3 <- cd_analysis_simple |>
  arrange(desc(pct_ba_plus_2019)) |>
  slice_head(n = 3) |>
  select(cd_id, borough, pct_ba_plus_2019) |>
  mutate(rank = "Top 3")

bottom3 <- cd_analysis_simple |>
  arrange(pct_ba_plus_2019) |>
  slice_head(n = 3) |>
  select(cd_id, borough, pct_ba_plus_2019) |>
  mutate(rank = "Bottom 3")

top_bottom_tbl <- bind_rows(top3, bottom3) |>
  relocate(rank) |>
  mutate(
    pct_ba_plus_2019 = paste0(round(pct_ba_plus_2019, 1), "%")
  )

kable(
  top_bottom_tbl,
  caption = "**Table 9: Community Districts with Highest and Lowest Educational Attainment (2019)**",
  col.names = c("Rank", "CD ID", "Borough", "BA+ Attainment"),
  align = c("l", "l", "l", "l")
)
Table 9: Community Districts with Highest and Lowest Educational Attainment (2019)
Rank CD ID Borough BA+ Attainment
Top 3 MN05 Manhattan 82.5%
Top 3 MN01 Manhattan 81.5%
Top 3 MN08 Manhattan 81.1%
Bottom 3 BX01 Bronx 11.7%
Bottom 3 BX05 Bronx 12.2%
Bottom 3 BX06 Bronx 12.9%

The interactive Leaflet below highlights EA disparities across CDs. Clicking on an individual CD reveals its percentage of EA.

Show code
#| label: fig-education-map
#| echo: false
#| fig-width: 8
#| fig-height: 6

library(leaflet)
library(sf)
library(htmltools)

# Join education to CD polygons and add neighborhood names
cd_shp_edu <- nyc_cd |>
  left_join(
    cd_analysis_simple |>
      select(cd_id, pct_ba_plus_2019),
    by = "cd_id"
  ) |>
  mutate(
    # Extract neighborhood name from shapefile (it's in the BoroCD column typically)
    # The NYC CD shapefile usually has these - adjust if your column name differs
    neighborhood = case_when(
      # MANHATTAN
      cd_id == "MN01" ~ "Lower Manhattan/Financial District",
      cd_id == "MN02" ~ "Greenwich Village/SoHo",
      cd_id == "MN03" ~ "Lower East Side/Chinatown",
      cd_id == "MN04" ~ "Chelsea/Clinton/Midtown",
      cd_id == "MN05" ~ "Midtown/Upper East Side",
      cd_id == "MN06" ~ "Stuyvesant Town/Turtle Bay",
      cd_id == "MN07" ~ "Upper West Side/Lincoln Square",
      cd_id == "MN08" ~ "Upper East Side/Yorkville",
      cd_id == "MN09" ~ "Morningside Heights/Hamilton Heights",
      cd_id == "MN10" ~ "Central Harlem",
      cd_id == "MN11" ~ "East Harlem",
      cd_id == "MN12" ~ "Washington Heights/Inwood",
      
      # BRONX
      cd_id == "BX01" ~ "Mott Haven/Hunts Point (South Bronx)",
      cd_id == "BX02" ~ "Longwood/Hunts Point",
      cd_id == "BX03" ~ "Morrisania/Crotona Park",
      cd_id == "BX04" ~ "Highbridge/Concourse",
      cd_id == "BX05" ~ "Fordham/University Heights",
      cd_id == "BX06" ~ "East Tremont/Belmont",
      cd_id == "BX07" ~ "Kingsbridge Heights/Bedford Park",
      cd_id == "BX08" ~ "Riverdale/Fieldston",
      cd_id == "BX09" ~ "Parkchester/Soundview",
      cd_id == "BX10" ~ "Throgs Neck/Co-op City",
      cd_id == "BX11" ~ "Morris Park/Pelham Parkway",
      cd_id == "BX12" ~ "Williamsbridge/Wakefield",
      
      # BROOKLYN
      cd_id == "BK01" ~ "Williamsburg/Greenpoint",
      cd_id == "BK02" ~ "Downtown Brooklyn/Fort Greene",
      cd_id == "BK03" ~ "Bedford-Stuyvesant",
      cd_id == "BK04" ~ "Bushwick",
      cd_id == "BK05" ~ "East New York/Starrett City",
      cd_id == "BK06" ~ "Park Slope/Carroll Gardens",
      cd_id == "BK07" ~ "Sunset Park/Windsor Terrace",
      cd_id == "BK08" ~ "Crown Heights/Prospect Heights",
      cd_id == "BK09" ~ "South Crown Heights/Wingate",
      cd_id == "BK10" ~ "Bay Ridge/Dyker Heights",
      cd_id == "BK11" ~ "Bensonhurst/Bath Beach",
      cd_id == "BK12" ~ "Borough Park/Kensington",
      cd_id == "BK13" ~ "Coney Island/Brighton Beach",
      cd_id == "BK14" ~ "Flatbush/Midwood",
      cd_id == "BK15" ~ "Sheepshead Bay/Gravesend",
      cd_id == "BK16" ~ "Brownsville/Ocean Hill",
      cd_id == "BK17" ~ "East Flatbush/Farragut",
      cd_id == "BK18" ~ "Canarsie/Flatlands",
      
      # QUEENS
      cd_id == "QN01" ~ "Astoria/Long Island City",
      cd_id == "QN02" ~ "Sunnyside/Woodside",
      cd_id == "QN03" ~ "Jackson Heights/East Elmhurst",
      cd_id == "QN04" ~ "Elmhurst/Corona",
      cd_id == "QN05" ~ "Ridgewood/Maspeth",
      cd_id == "QN06" ~ "Rego Park/Forest Hills",
      cd_id == "QN07" ~ "Flushing/Whitestone",
      cd_id == "QN08" ~ "Fresh Meadows/Hillcrest",
      cd_id == "QN09" ~ "Kew Gardens/Woodhaven",
      cd_id == "QN10" ~ "South Ozone Park/Howard Beach",
      cd_id == "QN11" ~ "Bayside/Little Neck",
      cd_id == "QN12" ~ "Jamaica/Hollis",
      cd_id == "QN13" ~ "Queens Village/Cambria Heights",
      cd_id == "QN14" ~ "Rockaway/Broad Channel",
      
      # STATEN ISLAND
      cd_id == "SI01" ~ "North Shore (St. George/Stapleton)",
      cd_id == "SI02" ~ "Mid-Island (New Springville)",
      cd_id == "SI03" ~ "South Shore (Tottenville/Great Kills)",
      
      TRUE ~ cd_id  # Fallback to CD ID if not matched
    ),
    
    # Pre-format display values
    ba_display = paste0(round(pct_ba_plus_2019, 1), "%"),
    
    # Create enhanced popup HTML with neighborhood name
    popup_html = paste0(
      "<div style='font-family: Arial, sans-serif;'>",
      "<strong style='font-size: 16px; color: #2c3e50;'>", cd_id, "</strong><br/>",
      "<span style='font-size: 13px; color: #7f8c8d;'>", neighborhood, "</span><br/><br/>",
      "<strong>BA+ Attainment:</strong> ", ba_display,
      "</div>"
    ),
    
    # Create hover label with neighborhood name
    hover_label = paste0(cd_id, ": ", neighborhood)
  )

# Transform to WGS84
cd_shp_edu_wgs84 <- st_transform(cd_shp_edu, 4326)

# Create color palette
pal <- colorNumeric(
  palette = "viridis",
  domain = cd_shp_edu_wgs84$pct_ba_plus_2019
)

# Create title as HTML
map_title <- tags$div(
  style = "text-align: center; padding: 10px; background-color: white; border-bottom: 2px solid #ddd;",
  tags$h3(
    style = "margin: 0; font-family: Arial, sans-serif; color: #2c3e50;",
    "Educational Attainment by Community District (2019)"
  ),
  tags$p(
    style = "margin: 5px 0 0 0; font-size: 13px; color: #7f8c8d;",
    "Percent of adults age 25+ with Bachelor's degree or higher | ACS 2015-2019 (5-year)"
  )
)

# Create leaflet map with title
map <- leaflet(cd_shp_edu_wgs84, width = "100%", height = "600px") |>
  addProviderTiles(providers$CartoDB.Positron) |>
  addPolygons(
    fillColor = ~pal(pct_ba_plus_2019),
    fillOpacity = 0.7,
    color = "white",
    weight = 2,
    opacity = 1,
    popup = ~popup_html,
    highlightOptions = highlightOptions(
      weight = 3,
      color = "#666",
      fillOpacity = 0.9,
      bringToFront = TRUE
    ),
    label = ~hover_label,  # Now shows "MN05: Upper West Side/Lincoln Square"
    labelOptions = labelOptions(
      style = list("font-weight" = "normal", padding = "3px 8px"),
      textsize = "12px",
      direction = "auto"
    )
  ) |>
  addLegend(
    pal = pal,
    values = ~pct_ba_plus_2019,
    title = "<strong>BA+ Attainment</strong>",
    position = "bottomright",
    opacity = 0.9,
    labFormat = labelFormat(suffix = "%", transform = function(x) round(x, 0))
  )

# Add title using htmlwidgets
htmlwidgets::prependContent(map, map_title)

Educational Attainment by Community District (2019)

Percent of adults age 25+ with Bachelor's degree or higher | ACS 2015-2019 (5-year)

The Education Premium

Pre-COVID, there was a strong linear correlation (r = 0.77) between neighborhood EA and its property value, as shown in Figure 2 below.

Show code
#| label: baseline-scatter
#| echo: false
#| fig-width: 8
#| fig-height: 5

# Calculate correlation for caption
baseline_cor <- cor(
  cd_analysis_simple$pct_ba_plus_2019,
  cd_analysis_simple$price_pre,
  use = "complete.obs"
)

ggplot(
  cd_analysis_simple,
  aes(x = pct_ba_plus_2019, y = price_pre)
) +
  geom_point(aes(color = borough), size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "black", linewidth = 1) +
  scale_y_continuous(
    labels = dollar_format(scale = 1e-3, suffix = "K"),
    name   = "Average Median Sale Price (2017–2019)"
  ) +
  scale_x_continuous(
    labels = percent_format(scale = 1),
    name   = "Adults with BA+ (2019)"
  ) +
  scale_color_brewer(palette = "Set2", name = "Borough") +
  labs(
    title    = "Figure 2: Pre-COVID Baseline: Education and Property Values",
    subtitle = paste0("Correlation: r = ", round(baseline_cor, 2))
  ) +
  theme_minimal(base_size = 13) +
  theme(
    legend.position = "bottom",
    plot.title      = element_text(face = "bold", size = 14),
    plot.subtitle   = element_text(size = 11)
  )

The simple linear regression below models this premium.

\[\text{Median Price}_{\text{pre}} = \beta_0 + \beta_1 \times \text{BA+\%}_{2019} + \epsilon\]

Table 10 presents the full regression results, showing that each additional pp of BA+ Attainment predicts a $13,816 higher median sale prices, which is statistically significant (α = 0.05, p < 0.001).

Show code
#| label: baseline-regression
#| echo: false

# STEP 1: Estimate baseline regression (Pre-COVID)
# Model: Median Price = β₀ + β₁(BA+ %) + ε
baseline_model <- lm(price_pre ~ pct_ba_plus_2019, data = cd_analysis_simple)

# Extract coefficients for use in narrative text
baseline_slope <- coef(baseline_model)[2]
baseline_intercept <- coef(baseline_model)[1]
slope_thousands <- round(baseline_slope / 1000, 1)  # Convert to thousands for display

# STEP 2: Create clean regression table using broom package
library(broom)

regression_table <- tidy(baseline_model, conf.int = TRUE) |>
  mutate(
    # Replace generic term names with descriptive labels
    term = c("Intercept", "BA+ Attainment (%)"),
    # Format numeric columns as currency/thousands
    across(c(estimate, std.error, conf.low, conf.high), ~comma(round(.x, 0))),
    # Round test statistics
    statistic = round(statistic, 2),
    # Format p-values (show "< 0.001" for very small values)
    p.value = ifelse(p.value < 0.001, "< 0.001", round(p.value, 3))
  ) |>
  select(
    Term = term,
    Coefficient = estimate,
    `Std. Error` = std.error,
    `95% CI Lower` = conf.low,
    `95% CI Upper` = conf.high,
    `t-statistic` = statistic,
    `p-value` = p.value
  )

# Display regression coefficients table
kable(
  regression_table,
  caption = "**Table 10: Pre-COVID Baseline Regression: Median Sale Price on Educational Attainment**",
  align = c("l", rep("l", 6))
)
Table 10: Pre-COVID Baseline Regression: Median Sale Price on Educational Attainment
Term Coefficient Std. Error 95% CI Lower 95% CI Upper t-statistic p-value
Intercept 234,656 63,274 107,952 361,361 3.71 < 0.001
BA+ Attainment (%) 13,842 1,486 10,866 16,818 9.31 < 0.001

Consequently, the full regression equation appears as:

\[\widehat{\text{Median Price}}_{\text{Pre-COVID}} = \$235{,}462 + \$13{,}816 \times \text{BA+\%}\]

Moreover, EA explains 59.5% of the variation in property values across CDs, as shown in Table 11 below.

Show code
#| label: baseline-model-fit-statistics
#| echo: false

# Extract and display model fit statistics
# R² shows proportion of variance explained by education
# Extract model fit statistics
model_fit <- glance(baseline_model) |>
  transmute(
    `` = round(r.squared, 3),
    `Adjusted R²` = round(adj.r.squared, 3),
    `Residual SE` = comma(round(sigma, 0)),
    `F-statistic` = round(statistic, 2),
    `p-value` = "< 0.001"
  )

kable(
  model_fit,
  caption = "**Table 11: Model Fit Statistics**",
  align = rep("l", 5)
)
Table 11: Model Fit Statistics
Adjusted R² Residual SE F-statistic p-value
0.603 0.596 225,270 86.74 < 0.001

Although EA appears as a dominant neighborhood predictor pre-COVID, the analysis below demonstrates a statistically significant disruption to this trend, exposing a sharp post-pandemic reversal in the education premium.


Post-COVID Analysis and Results

The Education Reversal

The post-COVID scatter plot below reveals a weaker but still positive relationship between EA and property values (r = 0.74, down from r = 0.77 pre-COVID).

Show code
#| label: postcovid-scatter
#| echo: false
#| fig-width: 8
#| fig-height: 5

# Correlation between baseline education and post-COVID prices
post_cor <- cor(
  cd_analysis_simple$pct_ba_plus_2019,
  cd_analysis_simple$price_post,
  use = "complete.obs"
)

ggplot(
  cd_analysis_simple,
  aes(x = pct_ba_plus_2019, y = price_post)
) +
  geom_point(aes(color = borough), size = 3, alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE, color = "black", linewidth = 1) +
  scale_y_continuous(
    labels = dollar_format(scale = 1e-3, suffix = "K"),
    name   = "Average Median Sale Price (2021–2023)"
  ) +
  scale_x_continuous(
    labels = percent_format(scale = 1),
    name   = "Adults with BA+ (2019)"
  ) +
  scale_color_brewer(palette = "Set2", name = "Borough") +
  labs(
    title    = "Figure 3: Post-COVID: Education and Property Values",
    subtitle = paste0("Correlation: r = ", round(post_cor, 2))
  ) +
  theme_minimal(base_size = 13) +
  theme(
    legend.position = "bottom",
    plot.title      = element_text(face = "bold", size = 14),
    plot.subtitle   = element_text(size = 11)
  )

While high EA neighborhoods maintain an absolute price advantage, the correlation’s decline suggests the education premium diminished during the pandemic years, leading to the rejection of Hypothesis 1.

The regression line’s flatter slope indicates that each additional pp of BA+ Attainment predicts a smaller price differential than in the pre-COVID period. Table 12 provides additional clarity with actual price changes across terciles, revealing which groups appreciated fastest.

Show code
#| label: main-did-tercile-table
#| echo: false

# Part 1: Tercile summary table
tercile_summary_table <- cd_analysis_simple |>
  group_by(edu_tercile) |>
  summarise(
    n_cds              = n(),
    avg_price_pre      = mean(price_pre,  na.rm = TRUE),
    avg_price_post     = mean(price_post, na.rm = TRUE),
    avg_change_dollars = mean(price_change_dollars, na.rm = TRUE),
    avg_change_pct     = mean(price_change_pct, na.rm = TRUE),
    .groups            = "drop"
  )

# Calculate exact difference BEFORE rounding (Low - High = positive difference)
low_mean <- tercile_summary_table |>
  filter(edu_tercile == "Low") |>
  pull(avg_change_pct)

high_mean <- tercile_summary_table |>
  filter(edu_tercile == "High") |>
  pull(avg_change_pct)

did_estimate_raw <- low_mean - high_mean  # This is the exact 14.17 → 14.2

# Add overall row
overall_row <- cd_analysis_simple |>
  summarise(
    edu_tercile        = "All CDs",
    n_cds              = n(),
    avg_price_pre      = mean(price_pre,  na.rm = TRUE),
    avg_price_post     = mean(price_post, na.rm = TRUE),
    avg_change_dollars = mean(price_change_dollars, na.rm = TRUE),
    avg_change_pct     = mean(price_change_pct,    na.rm = TRUE)
  )

# NOW round for display
tercile_summary_table <- bind_rows(tercile_summary_table, overall_row) |>
  mutate(
    avg_price_pre      = dollar(round(avg_price_pre, 0)),
    avg_price_post     = dollar(round(avg_price_post, 0)),
    avg_change_dollars = dollar(round(avg_change_dollars, 0)),
    avg_change_pct     = paste0(round(avg_change_pct, 2), "%")
  )

kable(
  tercile_summary_table,
  caption = "**Table 12: Pre- and Post-COVID Median Prices and Changes by Education Group**",
  col.names = c(
    "Education Group",
    "CDs",
    "Pre-COVID Median",
    "Post-COVID Median",
    "Change ($)",
    "Change (%)"
  ),
  align = c("l", "l", "l", "l", "l", "l")
)
Table 12: Pre- and Post-COVID Median Prices and Changes by Education Group
Education Group CDs Pre-COVID Median Post-COVID Median Change ($) Change (%)
Low 20 $517,091 $641,946 $124,854 26.03%
Medium 19 $720,373 $813,311 $92,938 15.03%
High 20 $1,031,182 $1,127,420 $96,237 11.86%
All CDs 59 $756,823 $861,699 $104,875 17.68%

Note: Growth rates represent the average of individual CD appreciation rates, not the percent change of the aggregate median.


These findings indicate a remarkable reversal of the traditional education premium. From 2017-19 to 2021-23, Low-EA CDs experienced 26.03% median price growth, whereas Medium-EA CDs and High-EA CDs grew only by 15.03% and 11.86%, respectively. The most notable finding is that High-EA CDs grew at less than half the rate of their Low-EA counterparts, representing a 14.2% point difference.

This result rejects Hypothesis 2, as Low-EA CDs saw the fastest appreciation. The Leaflet map below highlights this appreciation pattern across all CDs.

Show code
#| label: interactive-growth-map
#| echo: false
#| message: false
#| warning: false
#| fig.width: 10
#| fig.height: 8

library(sf)
library(leaflet)
library(htmltools)

# STEP 1: Create clean 59-row lookup table
hover_lookup <- cd_sales_panel |>
  mutate(
    period2 = case_when(
      period %in% c("pre_covid", "Pre-COVID")   ~ "Pre",
      period %in% c("post_covid", "Post-COVID") ~ "Post",
      TRUE                                      ~ NA_character_
    )
  ) |>
  filter(!is.na(period2)) |>
  group_by(cd_id, period2) |>
  summarise(median_price = median(median_price, na.rm = TRUE), .groups = "drop") |>
  pivot_wider(names_from = period2, values_from = median_price) |>
  left_join(cd_edu_summary, by = "cd_id") |>
  mutate(
    dollar_change = Post - Pre,
    pct_change = (Post - Pre) / Pre * 100
  ) |>
  mutate(
    cd_display = cd_id,
    tercile_display = as.character(edu_tercile),
    ba_display = paste0(round(pct_ba, 1), "%"),
    pre_display = dollar(round(Pre, 0)),
    post_display = dollar(round(Post, 0)),
    dollar_display = dollar(round(dollar_change, 0)),
    pct_display = paste0("+", round(pct_change, 1), "%")
  ) |>
  select(
    cd_id, edu_tercile,
    cd_display, tercile_display, ba_display,
    pre_display, post_display, dollar_display, pct_display
  )

# STEP 2: Join to spatial data and create HTML popup
cd_map_sf <- nyc_cd |>
  left_join(hover_lookup, by = "cd_id") |>
  filter(!is.na(edu_tercile)) |>
  mutate(
    # Create HTML-formatted popup
    popup_html = paste0(
      "<strong style='font-size: 14px;'>", cd_display, "</strong><br/>",
      "<strong>Tercile:</strong> ", tercile_display, "<br/>",
      "<strong>Education (BA+):</strong> ", ba_display, "<br/>",
      "<br/>",
      "<strong>Pre-COVID Price:</strong> ", pre_display, "<br/>",
      "<strong>Post-COVID Price:</strong> ", post_display, "<br/>",
      "<br/>",
      "<strong>Dollar Change:</strong> ", dollar_display, "<br/>",
      "<strong style='color: #d62728;'>Percent Change:</strong> ", pct_display
    )
  )

# STEP 3: Transform to WGS84
cd_map_wgs84 <- st_transform(cd_map_sf, 4326)

# STEP 4: Create color palette
pal <- colorFactor(
  palette = c("#d62728", "#ff7f0e", "#1f77b4"),
  levels = c("Low", "Medium", "High")
)

# STEP 5: Create title as HTML
map_title <- tags$div(
  style = "text-align: center; padding: 10px; background-color: white; border-bottom: 2px solid #ddd;",
  tags$h3(
    style = "margin: 0; font-family: Arial, sans-serif; color: #2c3e50;",
    "Property Value Growth by Education Tercile (2017–2023)"
  ),
  tags$p(
    style = "margin: 5px 0 0 0; font-size: 13px; color: #7f8c8d;",
    "Click any Community District for detailed price change statistics"
  )
)

# STEP 6: Create leaflet map
map <- leaflet(cd_map_wgs84, width = "100%", height = "600px") |>
  addProviderTiles(providers$CartoDB.Positron) |>
  addPolygons(
    fillColor = ~pal(edu_tercile),
    fillOpacity = 0.7,
    color = "white",
    weight = 2,
    opacity = 1,
    popup = ~popup_html,
    highlightOptions = highlightOptions(
      weight = 3,
      color = "#666",
      fillOpacity = 0.9,
      bringToFront = TRUE
    ),
    label = ~cd_id,
    labelOptions = labelOptions(
      style = list("font-weight" = "normal", padding = "3px 8px"),
      textsize = "12px",
      direction = "auto"
    )
  ) |>
  addLegend(
    pal = pal,
    values = ~edu_tercile,
    title = "<strong>Education Tercile</strong>",
    position = "bottomleft",
    opacity = 0.9
  )

# STEP 7: Add title using htmlwidgets
htmlwidgets::prependContent(map, map_title)

Property Value Growth by Education Tercile (2017–2023)

Click any Community District for detailed price change statistics

This reversal shows how Low-EA districts gained nearly $29,000 more than High-EA districts on average. In other words, a home purchased in a traditionally “less desirable” neighborhood in 2017 significantly outperformed its high-EA counterpart by 2023.

The t-test in Table 13 below confirms this pattern.

Table 13: Difference in Average Price Growth Between High-EA and Low-EA CDs
Comparison Difference 95% CI t-statistic df p-value
High − Low -14.2 pp [-22.9, -5.4] -3.27 37 0.002

Differences between high and low terciles are statistically significant at p = 0.002, with a 95% confidence interval entirely excluding zero. As a result, this outcome indicates that the pattern is unlikely to have occurred by chance.

Figure 4 below shows the profound magnitude of this reversal.

Non-overlapping confidence intervals confirm distinct economic outcomes between the High and Low EA groups, indicating the 14.2 pp divergence represents structural shifts rather than anomaly.

Parametric Regression Analysis

While terciles demonstrate reversal magnitude, a modified continuous regression quantifies how each incremental BA+ Attainment pp influenced post-COVID appreciation.

As shown in Table 14, each additional pp of BA+ Attainment predicts 0.376 pp less price growth. This statistically significant (p < 0.001) negative coefficient directly contradicts the pre-COVID pattern, where higher EA predicted higher prices. The relationship has not only weakened, it has reversed.

Show code
#| label: main-did-regression
#| echo: false

# STEP 1: Estimate post-COVID regression
# Model: Price Change (%) = β₀ + β₁(BA+ %) + ε
simple_lm <- lm(
  price_change_pct ~ pct_ba_plus_2019,
  data = cd_analysis_simple
)

# Extract coefficients for use in narrative text
post_slope <- coef(simple_lm)[2]
post_intercept <- coef(simple_lm)[1]

# STEP 2: Create clean regression table using broom package
regression_table <- tidy(simple_lm, conf.int = TRUE) |>
  mutate(
    # Replace generic term names with descriptive labels
    term = c("Intercept", "BA+ Attainment (%)"),
    # Format numeric columns for display
    across(c(estimate, std.error, conf.low, conf.high), ~round(.x, 3)),
    # Round test statistics
    statistic = round(statistic, 2),
    # Format p-values (show "< 0.001" for very small values)
    p.value = ifelse(p.value < 0.001, "< 0.001", round(p.value, 4))
  ) |>
  select(
    Term = term,
    Coefficient = estimate,
    `Std. Error` = std.error,
    `95% CI Lower` = conf.low,
    `95% CI Upper` = conf.high,
    `t-statistic` = statistic,
    `p-value` = p.value
  )

# Display regression coefficients table
kable(
  regression_table,
  caption = "**Table 14: Post-COVID Regression Price Growth on EA**",
  align = c("l", rep("l", 6))
)
Table 14: Post-COVID Regression Price Growth on EA
Term Coefficient Std. Error 95% CI Lower 95% CI Upper t-statistic p-value
Intercept 32.006 3.340 25.317 38.695 9.58 < 0.001
BA+ Attainment (%) -0.380 0.078 -0.537 -0.223 -4.84 < 0.001

Comparing pre- and post-COVID models reveals this shift’s extent.

In the pre-COVID period, higher education predicted higher absolute prices, with each pp of BA+ Attainment adding nearly $14,000 to median home values.

\[\widehat{\text{Median Price}} = \$235,462 + \$13,816 \times \text{BA+\%}\]

In the post-COVID period, this relationship inverted: higher education predicted slower price appreciation.

\[\widehat{\text{Price Change}} = 31.86\% - 0.376 \times \text{BA+\%}\]

As shown in Table 15 below, the post-COVID regression yields an \(R^2\) of 0.282, indicating that baseline EA alone explains 28.2% of the variation in price appreciation. Although lower than the pre-COVID model, this \(R^2\) yields acceptable explanatory power for a growth metric, confirming that EA remained a primary, yet inverted, driver of market disparity during the pandemic.

Show code
#| label: baseline-model-fit-statistics
#| echo: false

# Extract and display model fit statistics
# R² shows proportion of variance explained by education
# Extract model fit statistics
regression_glance <- glance(simple_lm)

# Extract and display model fit statistics
model_fit <- regression_glance |>
  transmute(
    `` = round(r.squared, 3),
    `Adjusted R²` = round(adj.r.squared, 3),
    `Residual SE` = paste0(round(sigma, 2), " pp"),
    `F-statistic` = round(statistic, 2),
    `p-value` = "< 0.001"
  )

kable(
  model_fit,
  caption = "**Table 15: Model Fit Statistics**",
  align = rep("l", 5)
)
Table 15: Model Fit Statistics
Adjusted R² Residual SE F-statistic p-value
0.291 0.279 11.89 pp 23.42 < 0.001

Internal Consistency Check

The tercile-based and regression-based approaches should yield consistent estimates if the education-growth relationship is approximately linear.

From Table 14, each additional pp increase in BA+ Attainment predicts a -0.380 pp change in price growth. Moreover, Table 6 highlighted the average education gap between High and Low terciles to be 40.1 pp. 

Using the regression coefficient, it is possible to predict the expected difference:

\[\text{Predicted Difference} = \text{Education Gap} \times \text{Regression Coefficient}\]

\[\text{Predicted Difference} = 40.1 \text{ pp} \times (-0.376) = -15.1 \text{ pp}\]

This means the regression model predicts that High-EA CDs should grow 15.1 pp less than Low-EA CDs.

From Table 12, we observed that Low-EA CDs actually grew 14.2 pp more than High-EA CDs (26.03% - 11.86% = 14.17% ≈ 14.2 pp).

Table 16 compares these two estimates to assess internal consistency.

Show code
#| label: main-did-consistency-check
#| echo: false

# Part 5: Internal consistency check
edu_coef <- coef(simple_lm)[2]

edu_gap <- cd_analysis_simple |>
  group_by(edu_tercile) |>
  summarise(avg_edu = mean(pct_ba_plus_2019, na.rm = TRUE), .groups = "drop") |>
  filter(edu_tercile %in% c("Low", "High")) |>
  summarise(gap = avg_edu[edu_tercile == "High"] - avg_edu[edu_tercile == "Low"]) |>
  pull(gap)

predicted_did <- edu_gap * edu_coef
observed_did <- did_estimate_raw

consistency_table <- tibble(
  Quantity = c(
    "Average education gap (High − Low)",
    "Regression coefficient (pp per 1% BA+)",
    "Predicted difference (High − Low)",
    "Observed difference (High − Low)"
  ),
  Value = c(
    paste0(round(edu_gap, 1), " pp"),
    round(edu_coef, 3),
    paste0(round(predicted_did, 1), " pp"),
    paste0(round(-observed_did, 1), " pp")  # Flip sign for consistency
  )
)

kable(
  consistency_table,
  caption = "**Table 16: Internal Consistency Check: Tercile DiD vs. Continuous Regression**",
  align = c("l", "l")
)
Table 16: Internal Consistency Check: Tercile DiD vs. Continuous Regression
Quantity Value
Average education gap (High − Low) 40.4 pp
Regression coefficient (pp per 1% BA+) -0.38
Predicted difference (High − Low) -15.3 pp
Observed difference (High − Low) -14.2 pp

Note: Both values expressed from High-EA perspective (negative = High grew slower than Low). The close match (−15.3 pp vs. −14.2 pp) confirms internal consistency.


The regression-based prediction (-15.3 pp, from High’s perspective) closely matches the observed tercile difference (+14.2 pp, from Low’s perspective). This close correspondence (e.g., less than a 7% rate of change) confirms internal consistency between the non-parametric (tercile) and parametric (regression) approaches.

Whether comparing discrete education groups or modeling continuous relationships, the conclusion remains the same: Low-EA CDs experienced substantially faster price growth during the post-COVID period, with the magnitude of this reversal measuring approximately 14-15 pp.


Robustness: Citywide Validation by Borough

The previous sections highlighted stark differences in EA and property values in two boroughs. However, it is essential to examine whether this inverted education-growth relationship holds across all five boroughs, as each contain different demographic compositions and unique pandemic experiences.

Table 17 reveals that pandemic-era property value growth occurred across all five boroughs, though growth rates varied.

Show code
#| label: robustness-borough-summary
#| echo: false

# Borough-level summary
borough_summary <- cd_analysis_simple |>
  group_by(borough) |>
  summarise(
    n_cds          = n(),
    avg_change_pct = mean(price_change_pct,na.rm = TRUE),
    sd_change_pct  = sd(price_change_pct,  na.rm = TRUE),
    min_change_pct = min(price_change_pct, na.rm = TRUE),
    max_change_pct = max(price_change_pct, na.rm = TRUE),
    .groups        = "drop"
  ) |>
  mutate(
    avg_change_pct = paste0(round(avg_change_pct, 1), "%"),
    sd_change_pct  = paste0(round(sd_change_pct,  1), "%"),
    min_change_pct = paste0(round(min_change_pct, 1), "%"),
    max_change_pct = paste0(round(max_change_pct, 1), "%")
  ) |>
  arrange(desc(avg_change_pct))

kable(
  borough_summary,
  caption = "**Table 17: Distribution of CD-Level Price Growth by Borough**",
  col.names = c(
    "Borough",
    "CDs",
    "Mean Change",
    "SD",
    "Min",
    "Max"
  ),
  align = c("l", "c", "r", "r", "r", "r")
)
Table 17: Distribution of CD-Level Price Growth by Borough
Borough CDs Mean Change SD Min Max
Manhattan 12 6.4% 15.9% -16.1% 45.7%
Bronx 12 31.7% 11.7% 4.8% 51.5%
Staten Island 3 22.6% 1.5% 21.6% 24.3%
Queens 14 16.9% 11.7% 2% 38.2%
Brooklyn 18 15.7% 8.6% 1.1% 39.4%

However, there are significant variations between boroughs. Specifically, Manhattan exhibited the highest volatility (\(SD = 15.9\%\))). Queens showed similar variation (\(SD=11.7\%\)), while Brooklyn remained relatively more consistent (\(SD=8.6\%\)).

Table 18 quantifies how EA impacts property values within each borough.

Table 18: Education–Growth Relationship by Borough
Borough CDs Correlation (r) Slope (pp per 1% BA+)
Staten Island 3 -0.991 -0.682
Bronx 12 -0.346 -0.476
Manhattan 12 -0.389 -0.298
Queens 14 -0.221 -0.230
Brooklyn 18 -0.009 -0.005

There is a consistent negative relationship between EA and price growth across all five boroughs. All boroughs show negative correlations and negative slopes, confirming the pattern extends beyond any single market.

Staten Island (SI), with its three CDs, exhibits the strongest negative relationship (r = -0.990, slope = -0.721).

Brooklyn’s near-zero slope (-0.005) and weak correlation (-0.008), suggesting that other factors (e.g., gentrification) drove appreciation more than EA. Additionally, its close proximity to Manhattan may have likely outweighed EA in determining appreciation. Consequently, Brooklyn’s unique outcome calls for deeper investigation in future research.

Figure 5 visualizes these borough-specific slopes for direct comparison.

Show code
#| label: robustness-slope-plot
#| echo: false
#| fig-width: 7
#| fig-height: 4

ggplot(
  borough_slopes,
  aes(x = reorder(borough, slope), y = slope)
) +
  geom_col(fill = "#7570b3", width = 0.6, alpha = 0.8) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  labs(
    title    = "Figure 5: Education–Growth Relationship Varies by Borough",
    subtitle = "Slope: Percentage points of price growth per 1pp increase in BA+ attainment",
    x        = "Borough",
    y        = "Slope (pp per 1% BA+)"
  ) +
  coord_flip() +
  theme_minimal(base_size = 13) +
  theme(
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 11),
    panel.grid.major.y = element_blank()
  )

This figure confirms that no borough experienced the traditional positive education-growth relationship during the post-COVID period. In four of five boroughs, there was a moderate-to-strong negative relationship, with slopes ranging from -0.223 to -0.721.

Discussion

Interpretation

Several factors likely explain this borough-wide education premium reversal.

Remote work reduced the need for living near employment-rich and amenity-filled hubs, usually concentrated in high-EA CDs. Also, affordability pressures led buyers to migrate toward undervalued CDs in the outer-borough areas, offering greater access to homeownership opportunities. Lastly, lower valuation CDs experienced post-pandemic market correction, leading to catch-up growth, or a donut effect, shifting market demand and creating higher prices in once affordable CDs.

These factors likely worked together as a systemic feedback loop. As the agglomeration premium of high-EA CDs diminished, market demand shifted toward peripheral, or outer-borough markets. This demand surge created a compounding effect, accelerating appreciation in Low-EA districts while High-EA growth stagnated.

It is essential to clarify that this analysis focuses on highlighting a change in growth rather than hierarchy. High-EA neighborhoods maintained their absolute price dominance throughout the pandemic. The reversal occurred primarily in appreciation rates. However, it remains unclear whether these findings reveal a permanent value change or a temporary disruption.

Contribution to Overarching Question

This analysis demonstrates that COVID-19 reshaped the relationship between neighborhood characteristics and property values by reversing the education premium, which was historically a strong predictor of urban real estate prices. Moreover, this finding connects with the team’s individual analyses of other neighborhood characteristics. This research:

  • Aligns with transit findings: Both accessibility and education premiums weakened.

  • Contextualizes density analysis: The apparent density penalty was inaccurate, likely driven by education.

  • Contrasts with job accessibility: Job accessibility remained stable while high-EA CDs lost their premium and low-EA CDs gained value, showing the education premium reversal.

  • Provides baseline context for crime results: Heterogeneous effects by initial crime conditions mirror our tercile patterns.

The education reversal appears to be the dominant structural shift, with other characteristics showing stability (jobs) or modest weakening (density and transit). This suggests pandemic housing dynamics re-calibrated a longstanding urban economic geography.

Limitations and Conclusion

While this analysis provides clear evidence of a post-pandemic reversal, several limitations inform the results. First, CD level data may obscure location-specific variations, such as gentrification within Low EA areas. Second, residents in high EA CDs may have held onto their properties; thus, low growth may have resulted from a lack of available real estate, masking retention premiums. Finally, as the data extends only through 2023, it is unclear if these findings are indicative of temporary disruptions or longstanding, permanent shifts, considering recent employer mandates calling for employees to return to physical work locations.

Despite these limitations, this research reveals pre- and post-COVID shifts reversed the relationship between neighborhood EA and property values in NYC. The pandemic not only disrupted NYC property value logic, it reshaped buyer preferences. Consequently, affordability pressures created a new urban landscape where the traditional high EA clusters, a once historical driver of property value, are less significant with demand shifts toward value offered by outer-borough CDs.


References

Data Sources


Methods and Technical References


Background and Context Readings


Peer Project Pages